Challenges in the Fusion of Video and Audio for Robust Speech Recognition
نویسندگان
چکیده
As speech recognizers become more robust, they are popularly accepted as an essential component of human-computer interaction. State-ofthe-art speaker-independent speech recognizers exist with word recognition error rates below 10%. To achieve even higher and robust recognition performance, multi-modal speech recognition techniques that combine video and audio information call be used. Speech reading, the video portion of bimodal speech recognizer, introduces not only additional computatalonal cost of video processing, but also chanllenges in the design of the integrated audio-video recognizer.
منابع مشابه
An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition
Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...
متن کاملImproving the performance of MFCC for Persian robust speech recognition
The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to t...
متن کاملAudiovisual Information Fusion in Human-Computer Interfaces and Intelligent Environments: A Survey
Microphones and cameras have been extensively used to observe and detect human activity and to facilitate natural modes of interaction between humans and intelligent systems. Human brain processes the audio and video modalities extracting complementary and robust information from them. Intelligent systems with audio-visual sensors should be capable of achieving similar goals. The audio-visual i...
متن کاملImproved Speech Recognition using Adaptive Audio-visual Fusion via a Stochastic Secondary Classifier
The adaptive fusion of video and audio is one of the fundamental pursuits of audio visual speech recognition (AVSR). In this paper the use of a high dimensional secondary classijier on the word likelihood scores from both the audio and video modalities is investigated fo r the purposes of adaptive fusion. Results are presented that lie above or equal to the boundary of catastrophic fusion acros...
متن کاملAdaptive Audio-visual Speech Recognition in the Presence of Audio and Video Distortions
Audio-visual speech recognition leads to significant improvements compared to pure audio recognition especially when the audio signal is corrupted by noise. In this article we investigate the consequences of additional degradations in the video signal on the audio-visual recognition process.. We degrade the images with noise, a JPEG compression, and errors in the localization of the mouth regio...
متن کامل